Indexable PLA for Efficient Similarity Search
نویسندگان
چکیده
Similarity-based search over time-series databases has been a hot research topic for a long history, which is widely used in many applications, including multimedia retrieval, data mining, web search and retrieval, and so on. However, due to high dimensionality (i.e. length) of the time series, the similarity search over directly indexed time series usually encounters a serious problem, known as the “dimensionality curse”. Thus, many dimensionality reduction techniques are proposed to break such curse by reducing the dimensionality of time series. Among all the proposed methods, only Piecewise Linear Approximation (PLA) does not have indexing mechanisms to support similarity queries, which prevents it from efficiently searching over very large timeseries databases. Our initial studies on the effectiveness of different reduction methods, however, show that PLA performs no worse than others. Motivated by this, in this paper, we re-investigate PLA for approximating and indexing time series. Specifically, we propose a novel distance function in the reduced PLA-space, and prove that this function indeed results in a lower bound of the Euclidean distance between the original time series, which can lead to no false dismissals during the similarity search. As a second step, we develop an effective approach to index these lower bounds to improve the search efficiency. Our extensive experiments over a wide spectrum of real and synthetic data sets have demonstrated the efficiency and effectiveness of PLA together with the newly proposed lower bound distance, in terms of both pruning power and wall clock time, compared with two stateof-the-art reduction methods, Adaptive Piecewise Constant Approximation (APCA) and Chebyshev Polynomials (CP).
منابع مشابه
Modelling Peer-to-Peer Data Networks Under Complex System Theory
A Peer-to-peer Data Network (PDN) is an open and evolving society of peer nodes that assemble into a network to share their data for mutual benefit. PDNs are enabled by distributed query processing. We argue that with a self-organizing, dynamic, and large-scale architecture and nodes that inherit extensive autonomy from their human users, this new generation of distributed database systems shou...
متن کاملMeasuring and Modeling the Web a Dissertation Submitted to the Department of Computer Science and the Committee on Graduate Studies of Stanford University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
The last couple of decades have witnessed a phenomenal growth in the World Wide Web. The Web has now become a ubiquitous channel for information sharing and dissemination. This has created a whole new set of research challenges. This thesis describes several research contributions in an endeavor towards a better understanding of the Web. We focus on two major topics: (1) measuring the size of t...
متن کاملMethods for the Efficient Discovery of Large Item-Indexable Sequential Patterns
An increasingly relevant set of tasks, such as the discovery of biclusters with order-preserving properties, can be mapped as a sequential pattern mining problem on data with item-indexable properties. An item-indexable database, typically observed in biomedical domains, does not allow item repetitions per sequence and is commonly dense. Although multiple methods have been proposed for the effi...
متن کاملAn improved opposition-based Crow Search Algorithm for Data Clustering
Data clustering is an ideal way of working with a huge amount of data and looking for a structure in the dataset. In other words, clustering is the classification of the same data; the similarity among the data in a cluster is maximum and the similarity among the data in the different clusters is minimal. The innovation of this paper is a clustering method based on the Crow Search Algorithm (CS...
متن کاملPrivatePond: Outsourced Management of Web Corpuses
With the rise of cloud computing, it is increasingly attractive for end-users (organizations and individuals) to outsource the management of their data to a small number of largescale service providers. In this paper, we consider a user who wants to outsource storage and search for a corpus of web documents (e.g., an intranet). At the same time, the corpus may contain confidential documents tha...
متن کامل